{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Central Limit Theorem\n",
    "\n",
    "In this section, we are going to show you how the central limit theorem works. We'll:\n",
    "* generate random samples from a population\n",
    "* take the means of the samples\n",
    "* visualize the resulting means\n",
    "\n",
    "You'll see that although the population does not follow a Gaussian distribution, the resulting distribution of the sample means does look Gaussian.\n",
    "\n",
    "To start things off, run the code cell below. This cell will run a helper function that creates the population data, then visualizes the population data and calculates the mean of the population data. There are 10,000 data points in the population. \n",
    "\n",
    "If you run the cell multiple times, you'll notice that the distribution changes slightly; however, the general shape remains the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import helpers\n",
    "import numpy as np\n",
    "%matplotlib inline\n",
    "\n",
    "population_data = helpers.distribution(50, 10000, 100)\n",
    "helpers.histogram_visualization(population_data)\n",
    "print('population mean ', np.mean(population_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Taking Samples from the Population\n",
    "\n",
    "The next code cell will randomly choose N data points from the population. These N data points will be called a sample. We are using the numpy library's random.choice method to randomly choose N values, which you can read about [here](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.choice.html).\n",
    "\n",
    "Run the code cell below to see some example output. The code randomly draws 10 data points from the population to make a sample of size 10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def random_sample(population_data, sample_size):\n",
    "    return np.random.choice(population_data, size = sample_size)\n",
    "\n",
    "random_sample(population_data, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Calculating the Sample Mean\n",
    "\n",
    "We'll next use the numpy library to calculate the mean of each randomly generated sample. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_mean(sample):\n",
    "    return np.mean(sample)\n",
    "\n",
    "# take a sample from the population\n",
    "example_sample = random_sample(population_data, 10)\n",
    "\n",
    "# calculate the mean of the sample and output the results\n",
    "sample_mean(example_sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Central Limit Theorem Results\n",
    "\n",
    "Now, we'll use the random_sample() function and the sample_mean() function to show how the central limit theorem works.\n",
    "\n",
    "The code below contains a for loop that draws a random sample of size N, then takes the mean of the sample, and stores the means in a list. Each iteration of the for loop will have a different random sample. Study the code below and then run the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "###\n",
    "# Code for showing how the central limit theorem works.\n",
    "# The function inputs:\n",
    "# population - population data\n",
    "# n - sample size\n",
    "# iterations - number of times to draw random samples\n",
    "\n",
    "def central_limit_theorem(population, n, iterations):\n",
    "    sample_means_results = []\n",
    "    for i in range(iterations):\n",
    "        # get a random sample from the population of size n\n",
    "        sample = random_sample(population, n)\n",
    "        \n",
    "        # calculate the mean of the random sample \n",
    "        # and append the mean to the results list\n",
    "        sample_means_results.append(sample_mean(sample))\n",
    "    return sample_means_results\n",
    "\n",
    "print('Means of all the samples ')\n",
    "central_limit_theorem(population_data, 10, 10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualize the Results - Sample Size = 30\n",
    "\n",
    "The next cell calculates the means of ten-thousand samples each of size 30 and then visualizes the sample means using a histogram. Notice that the results roughly have the shape of a Gaussian distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def visualize_results(sample_means):\n",
    "\n",
    "    plt.hist(sample_means, bins = 30)\n",
    "    plt.title('Histogram of the Sample Means')\n",
    "    plt.xlabel('Mean Value')\n",
    "    plt.ylabel('Count')\n",
    "\n",
    "# Take random sample and calculate the means\n",
    "sample_means_results = central_limit_theorem(population_data, 30, 10000)\n",
    "\n",
    "# Visualize the results\n",
    "visualize_results(sample_means_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we started off with a population that definitely did not have a Gaussian distribution. But by taking samples of the distribution and calculating the sample means, we end up with something that looks like a Gaussian distribution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualize the Results - Sample Size = 1\n",
    "\n",
    "One part of the central limit theorem states that the sample size needs to be large enough. A general rule of thumb is that the sample size should be greater than or equal to 30. Let's try using different sample sizes to see what the results look like.\n",
    "\n",
    "An exagerrated case would be to use a sample size of 1. This should give us a similar distribution to the original population. Run the code below to see the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take random sample and calculate the means\n",
    "sample_means_results = central_limit_theorem(population_data, 1, 10000)\n",
    "\n",
    "# Visualize the results\n",
    "visualize_results(sample_means_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualize the Results - Sample Size = 10\n",
    "\n",
    "Now, let's use the minimum recommended sample size of 30 and see what happens.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take random sample and calculate the means\n",
    "sample_means_results = central_limit_theorem(population_data, 10, 10000)\n",
    "\n",
    "# Visualize the results\n",
    "visualize_results(sample_means_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a sample size of 10, the distribution of the sample means does look somewhat Gaussian."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualize the Results - Sample Size = 1000\n",
    "\n",
    "Let's go even further and use a larger sample size: in this case 1000.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take random sample and calculate the means\n",
    "sample_means_results = central_limit_theorem(population_data, 1000, 10000)\n",
    "\n",
    "# Visualize the results\n",
    "visualize_results(sample_means_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This result not only looks Gaussian, but you can see another trend. The spread of the data (ie the standard deviation) is decreasing as the sample size increases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualize the Results - Sample Size = 10000\n",
    "\n",
    "What happens if the sample size equals the population size? Because we're taking [random samples with replacement](http://stattrek.com/statistics/dictionary.aspx?definition=Sampling_with_replacement), it's very unlikely that one of the samples would be exactly the population; however, the standard deviation should decrease even further since each sample will likely be similar to the population."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take random sample and calculate the means\n",
    "sample_means_results = central_limit_theorem(population_data, 10000, 10000)\n",
    "\n",
    "# Visualize the results\n",
    "visualize_results(sample_means_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# Conclusion \n",
    "Notice as well that the center of these distributions is near the original population mean.\n",
    "\n",
    "Think about collecting data in the real world. If you wanted to find the distribution of human height around the world, you could measure every single person's height and analyze the results. If you took the mean of your results, you would have the true average of human height; however, measuring the entire world population is not feasible. \n",
    "\n",
    "Instead, you could take a sample of heights. If you only measured thirty people, your sample means is likely to be far from the population mean. However, if you measured 2 billion randomly chosen people, the sample mean will probably be closer to the population mean. The larger your sample, the more likely your sample mean will match the true population mean."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
